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ABSTRACT 

With the huge amount of information available online, the 
World Wide Web is a fertile area for data mining research. 
The Web mining research is at the cross road of research 
from several research communities, such as database, infor- 
mation retrieval, and within AI, especially the sub-areas of 
machine learning and natural language processing. How- 
ever, there is a lot of confusions when comparing research 
efforts from different point of views. In this paper, we sur- 
vey the research in the area of Web mining, point out some 
confusions regarded the usage of the term Web mining and 
suggest three Web mining categories. Then we situate some 
of the research with respect to these three categories. We 
also explore the connection between the Web mining cate- 
gories and the related agent paradigm. For the survey, we 
focus on representation issues, on the process, on the learn- 
ing algorithm, and on the application of the recent works 
as the criteria. We conclude the paper with some research 
issues. 
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1. INTRODUCTION 

The World Wide Web (Web) is a popular and interactive 
medium to disseminate information today. The Web is huge, 
diverse, and dynamic and thus raises the scalability, multi- 
media data, and temporal issues respectively. Due to those 
situations, we are currently drowning in information and 
facing information overload B4|. Information users could 
encounter, among others, the Hollowing problems when in- 
teracting with the Web: 

a. Finding relevant information: People either browse or 
use the search service when they want to find specific in- 
formation on the Web. When a user uses search service he 
or she usually inputs a simple keyword query and the query 
response is the list of pages ranked based on their similarity 
to the query. However today's search tools have the follow- 
ing problems |^] . The first problem is low precision, which 
is due to the irrelevance of many of the search results. This 
results in a difficulty finding the relevant information. The 
second problem is low recall, which is due to the inability 
to index all the information available on the Web. This re- 



sults in a difficulty finding the unindexed information that 
is relevant. See |80| for some other search engine problems. 

b. Creating new knowledge out of the information available 
on the Web: Actually this problem could be regarded as a 
sub-problem of the problem above. While the problem above 
is usually a query-triggered process (retrieval oriented) , this 
problem is a data-triggered process that presumes that we 
already have a collection of Web data and we want to ex- 
tract potentially useful knowledge out of it (data mining 
oriented). Recent research g3|; [29) focuses on utilizing 
the Web as a knowledge base for decision making. 

c. Personalization of the information: This problem is often 
associated with the type and presentation of information, 
since it is likely that people differ in the contents and pre- 
sentations they prefer while interacting with the Web. 

On the other hand, the information providers could en- 
counter these problems, among others, when trying to achieve 
their goals on the Web: 

d. Learning about consumers or individual users: This is 
a problem that specifically deals with the problem c above, 
which is about knowing what the customers do and want. 
Inside this problem, there are sub-problems such as mass 
customizing the information to the intended consumers or 
even to personalize it to individual user, problems related to 
effective Web site design and management, problems related 
to marketing, etc. 

Web mining techniques could be used to solve the informa- 
tion overload problems above directly or indirectly. How- 
ever, we do not claim that Web mining techniques are the 
only tools to solve those problems. Other techniques and 
works from different research areas, such as database (DB), 
information retrieval (IR), natural language processing (NLP), 
and the Web document community, could also be used. By 
the direct approach we mean that the application of the Web 
mining techniques directly addresses the above problems. 
For example, a Newsgroup agent that classifies whether the 
news is relevant to the user. By the indirect approach we 
mean that the Web mining techniques are used as a part 
of a bigger application that addresses the above problems. 
For example, Web mining techniques could be used to create 
index terms for the Web search services. 
The Web mining research is a converging research area from 
several research communities, such as database, IR, and AI 
research communities especially from machine learning and 
NLP. This paper is an attempt to put the research done in 
a more structured way from the machine learning point of 
view. However, the methods of the research that we survey 
do not necessarily use well-known machine learning algo- 
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rithms. Since this is a huge, interdisciplinary, and very dy- 
namic research area, there are undoubtedly some omissions 
in our coverage. 

This paper is structured as follows. In section 2 we give 
an overview of Web mining, describe some confusions in the 
usage of the term Web mining, provide a classification, and 
relate this classification to the agent paradigm. In section 3, 
4 and 5 we describe some research that represent the range 
of the research in their respective categories. In section 6 we 
discuss some related work and finally we conclude in section 
7. 

2. WEB MINING 
2.1 Overview 

Web mining is the use of data mining techniques to automat- 
ically discover and extract information from Web documents 
and services ^(J. This area of research is so huge today 
partly due to the interests of various research communities, 
the tremendous growth of information sources available on 
the Web and the recent interest in e-commerce. This phe- 
nomenon partly creates confusion when we ask what con- 
stitutes Web mining and when comparing research in this 
area. Similar to Etzioni fic|| . we suggest decomposing Web 
mining into these subtasks, namely: 

1. Resource finding: the task of retrieving intended Web 
documents. 

2. Information selection and pre-processing: automati- 
cally selecting and pre-processing specific information 
from retrieved Web resources. 

3. Generalization: automatically discovers general pat- 
terns at individual Web sites as well as across multiple 
sites. 

4. Analysis: validation and/or interpretation of the mined 
patterns. 

By resource finding we mean the process of retrieving the 
data that is either online or offline from the text sources 
available on the Web such as electronic newsletters, elec- 
tronic newswire, newsgroups, the text contents of HTML 
documents obtained by removing HTML tags, and also the 
manual selection of Web resources. We also include text 
sources that originally were not accessible from the Web 
but are accessible now, such as online texts made for re- 
search purposes only, text databases, etc. The information 
selection and pre-processing step is any kind of transfor- 
mation processes of the original data retrieved in the IR 
process. These transformations could be either a kind of 
pre-processing that are mentioned above such as removing 
stop words, stemming, etc. or a pre-processing aimed at 
obtaining the desired representation such as finding phrases 
in the training corpus, transforming the representation to 
relational or first order logic form, etc. In step 3 above, ma- 
chine learning or data mining techniques are typically used 
for the generalization. We should also note that humans 
play an important role in the information or knowledge dis- 
covery process on the Web since the Web is an interactive 
medium. This is especially important for validation and/or 
interpretation in step 4. Thus, interactive query-triggered 
knowledge discovery is as important as the more automatic 



data-triggered knowledge discovery. However, we exclude 
the knowledge discovery done manually by humans. As we 
will see later in section 3, the process 1 - 3 - 4 is also used. 
Thus, Web mining refers to the overall process of discovering 
potentially useful and previously unknown information or 
knowledge from the Web data. It implicitly covers the stan- 
dard process of knowledge discovery in databases (KDD) 
We could simply view Web mining as an extension 
of KDD that is applied on the Web data. From the KDD 
point of view, the information and knowledge terms are in- 
terchangeable [^3). There is a close relationship between 
data mining, machine learning and advanced data analysis 
|8S| | . However, throughout the paper, we discuss the Web 
mining research where machine learning techniques are used. 
Although mining is an intriguing word to use, it is not a good 
metaphor to describe the overall knowledge discovery pro- 
cess [^3) and what people really do in the field jfil]]. Web 
mining is often associated with IR or IE. However, web min- 
ing or information discovery on the Web is not the same as 
IR or IE. 

2.1.1 Web Mining and Information Retrieval 

Some have claimed that resource or document discovery (IR) 
on the Web is an instance of Web (content) mining and oth- 
ers associate Web mining with intelligent IR. Actually IR 
is the automatic retrieval of all relevant documents while at 
the s ame time retrieving as few of the non-relevant as pos- 
sible [119 . IR has the primary goals of indexing text and 



searching for useful documents in a collection and nowadays 
research in IR includes modeling, document classification 
and categorization, user interfaces, data visualization, fil- 
tering, etc. The task that can be considered to be an 
instance of Web mining is Web document classification or 
categorization, which could be used for indexing. Viewed in 
this respect, Web mining is part of the (Web) IR process. 
However, we should note that not all of the indexing tasks 
use data mining techniques. 

2.1.2 Web Mining and Information Extraction 
IE has the goal of transforming a collection of documents, 
usually with the help of an IR system, into information that 
is more readily digested and analyzed IE aims to ex- 

tract relevant facts from the documents while IR aims to 
select relevant documents j99| . While IE is interested in the 
structure or representation of a document, IR vi ews the text 
in a document just as a bag of unordered words [ 123 1 . Thus, 
in general IE works at a finer granularity level than IR does 
on the documents. However, the differences between th e tw o 



become blurred if the interest of IR is in extraction [100 



and when used in the context of vague forms of information 
in which a full text IR system can provide some IE features 

Building IE systems manually is not feasible and scalable 
for such a dynamic and diverse medium such as Web con- 
tents ]92| . Due to this nature of the Web, most IE systems 
focus on specific Web sites to extract. Others use machine 
learning or data mining techniques to learn the extraction 
patterns or rules for Web documents semi-automatically or 
automatically fr^| . Within this view, Web mining is part 
of the (Web) IE process. Other views regarding the rela- 
tionship between (Web) IE and Web mining also exist. The 
results of the IE process could be in the form of a struc- 
tured database or could be a compression or summary of 
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the original text or documents. One could view for the for- 
mer that IE is a kind of pre-processing stage in the Web 
mining process, which is the step after the IR process and 
before the data mining techniques are being performed. In 
a similar view, IE can also be used to improve the indexing 
process, which is part of the IR process. Conversely, one 
could also argue for the latter that IE is an instance of text 
or Web mining since the summary or the compressed form of 
a document is a new form of information that does not exist 
before. However, we advocate the view that Web mining is 
used to improve Web IE (Web mining is part of IE) . 
There are basically two types of IE: IE from unstructured 
texts and IE from semi-structured data j9^j] . There are con- 
siderable differences between the IE systems that are used 
for unstructured documents with those that are used for 
semi-structured or even structured documents. IE tasks 
from unstructured natural language texts (classical or tra- 
ditional IE tasks) typically use a rather basic to a slightly 
deeper linguistic pre-processing before performing data min- 
ing. Classical or traditional IE research, with roots on the 
NLP community, has been studied for quite a long time 



other methods used for Web mining besides machine learn- 
ing methods. Some examples are some proprietary algo- 
rithms that are used for mining the hubs and author ities 



123]. We could say that Advanced Research Projects 



Agency (ARPA) helped creating the field (classical IE) be- 
cause the evaluations of IE cannot be separated from the 
ARPA sponsored Message Un derst anding Conferences (MUCs) 
and the TIPSTER IE project MUCs and TIPSTER 

are competitive environments that seek to improve IE and 
IR technologies Classical IE usually relies on lin- 

guistic pre-processing such as s yntac tic analysis, semantic 
analysis, and discourse analysis [111; |76). Indee d, c las- 
sical IE could be called a core language technology I123J. 
With the increasing popularity of the Web, there is a need 
for structural IE systems that extract information from semi- 
structured documents. Structural IE research is different 
from the classical on e as it usually utilizes the meta-information 
(e.g. HTML tags 111 |, simple syntactics j7(J, or delim- 
iters that are available inside the semi-structured data. 
Structural IE approaches that do not use linguistic con- 
straints are termed wrapper induction [Q. Some of the 
structural IE systems are built manually by knowledge en- 
gineering approach, for examples see [^6| |s| |5S|. However, 
more and more structural IE systems for the Web are built 
(semi-) automatically using machine learning techniques or 
other algorithms as building the systems manually is no 
lon ger a ppropriate ]7q |. Some examples are JFt| ; p6| : ps] ; 

111 . These systems are usually built by using machine 
learning or data mining techniques, which learn extraction 
rules from the annotated corpora. For more explanations 
and the categories of IE we point interested readers to the 
following survey papers. For classical IE and the issues of 
IE for unstructured texts we r efer to 
structural IE we refer to jLll| ; . 



mil § 



in 



and for 



2.1.3 Web Mining and Machine Learning Applied on 
the Web 

Web mining is not the same as learning from the Web or 
machine learning techniques applied on the Web. On the 
one hand, there are some applications of machine learning 
applied on the Web that are not instances of Web mining. 
An example of this is a machine learning technique th at is 
used to spider the Web efficiently for a specific topic 104 ; 
jsif that emphasize on planning the best path that is going 
to be traversed next. On the other hand, there are some 



124], DataGuides [ p5[ p6| and Web schema discovery [12C; 
B6j. However, there is a close relationship between the two 
research areas. Machine learning techniques support and 
help Web mining as they could be applied to the processes 
in Web mining. For example recent research j90| shows that 
applying machine learning techniques could improve the text 
classification process compared to the traditional IR tech- 
niques. In short, Web mining intersects with the application 
of machine learning on the Web. 

2.2 Web Mining Categories 

In this section we give the overview of each category. More 
detailed explanations are given in the respective sections. 
Similar to Madria, et al. jsijl and Borges and Levene [ fl5| , 
we categorize Web mining into three areas of interest based 
on which part of the Web to mine: Web content mining, 
Web structure mining, and Web usage mining. Web con- 
tent mining describes the discovery of useful information 
from the Web contents/data/documents. However, what 
consist of the Web contents could encompass a very broad 
range of data. Previously the Internet consists of different 
types of services and data sources such as Gopher, FTP 
and Usenet. Now most of those data are eitherported to 
or accessible from the Web. It is mentioned in []65| that in 
the last several years the growth in the amount of govern- 
ment information has been tremendous. We also know the 
existence of Digital Libraries that are also accessible from 
the Web. We also see that many companies are transform- 
ing their businesses and services electronically. As a con- 
sequence many of the company databases that previously 
resided in the legacy systems are being ported to or made 
accessible from the Web. Thus the employees, partners, or 
even customers could access some of the company database 
directly from Web based interfaces. Another consequence of 
this transformation is the existence of Web applications so 
that the users could access the applications through Web in- 
terfaces. Many applications and systems are being migrated 
to the Web and many types of applications are emerging in 
the Web environment. Of course some of the Web content 
data are hidden data, which cannot be indexed. These data 
are either generated dynamically as a result of queries and 
reside in the DBMSs or are private. In short, the Web al- 
ready contains many kinds and types of data. 
Basically, the Web content consists of several types of data 
such as textual, image, audio, video, metadata as well as 
hyperlinks. Recent research on mi ning multi types of data 
is termed multimedia data mining [ p.2q ] . Thus we could con- 
sider multimedia data mining as an instance of Web content 
mining. However this line of research still receives less at- 
tention than the research on the text or hypertext contents 
|l28| ; p9[ . The Web content data consist of unstructured 
data such as free texts, semi-structured data such as HTML 
documents, and a more structured data such as data in the 
tables or database generated HTML pages. However, much 
of the Web content data is unstructured text data pp| : [1; 
0. The research around applying data mining techniques 
to unstructured text is termed knowledge discovery in text s 
(KDT) IS, or text data mining Ell, or text mining |Tl5 



Hence we could consider text mining as an instance of Web 
content mining. We discuss text mining further in the next 
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section. We could differentiate the research done in Web 
content mining from two different points of view: IR and 
DB |30| views. The goal of Web content mining from the IR 
view is mainly to assist or to improve the information find- 
ing or filtering the information to the users usually based 
on either inferred or solicited user profiles, while the goal 
of Web content mining from the DB view mainly tries to 
model the data on the Web and to integrate them so that 
more sophisticated queries other than the keywords based 
search could be performed. These viewpoints are further 
discussed in the next section. 

Web structure mining tries to discover the model un- 
derlying the link structures of the Web. The model is based 
on the topology of the hyperlinks with or without the de- 
scription of the links. This model can be used to categorize 
Web pages and is useful to generate information such as 
the similarity and relationship between different Web sites. 
Web structure mining could be used to discover authority 
sites for the subjects (authorities) and overview sites for the 
subjects that point to many authorities (hubs). 
Web usage mining |30) tries to make sense of the data gen- 
erated by the Web surfer's sessions or behaviors. While the 
Web content and structure mining utilize the real or pri- 
mary data on the Web, Web usage mining mines the sec- 
ondary data derived from the interactions of the users while 
interacting with the Web. The Web usage data includes the 
data from Web server access logs, proxy server logs, browser 
logs, user profiles, registration data, user sessions or trans- 
actions, cookies, user queries, bookmark data, mouse clicks 
and scrolls, and any other data as the results of interac- 
tions. Table 1 gives an overview of the above Web mining 
categories (the explanations are given in the subsequent sec- 
tions) . 

However we should emphasize that the distinctions between 
the above categories are not clear-cut. Web content mining 
might utilize text and links and even the profiles that are 
either inferred or inputted by the users. User profiles are 
mostly used for the user modeling applications or personal 
assistants. The same is true for Web structure mining that 
could use the information about the links in addition to the 
link structures. Moreover we could infer the traversed links 
from the documents that were requested during the user 
session from the logs generated by the server. We could also 
characterize the categories above from the point of view of 
the scope of most of the work done in the respective areas: 
local scope spans an individual Web site while global scope 
spans the entire Web. The scope of the Web content mining 
from the IR view and Web structure mining is global while 
the scope of the Web content mining from the DB view and 
Web usage mining is local. However this characterization is 
not clear-cut either. 

In practice, the three Web mining tasks above could be 
used in isolation or combined in an application, especially 
in Web content and structure mining since the Web docu- 
ments mig ht also contain links. For example, Chakrabarti 
et al. p4| uses as Web content the terms in a document's 
link neighborhood and as Web structure the links from its 
neighbors, to classify Web pages. Joachims et al. |j7| use 
Web content and usage to build a software tour agent for 
assisting users browsing a Web site. 

2.3 Web Mining and the Agent Paradigm 

Web mining is often viewed from or implemented within 



Table 2: The association between the categories of Web min- 
ing and the agent paradigm 



Content-based niters 
Reputation-based filters 
Collaborative or social-based 
filters 

Event-based filters 
Hybrid filters 



Content mining 

Structure (and content) mining 

Usage mining 

Usage mining 

Combination of the categories 



an agent paradigm. Thus, Web mining has a close rela- 
tionship with software agents or intelligent agents. Indeed 
some of these agents perform data mining tasks to achieve 
their goals. According to Green, et al. [B7J there are three 
sub-categories of software agents: user interface agents, dis- 
tributed agents, and mobile agents. The sub-categories of 
software agents that are relevant for data mining tasks are 
user interface agents and distributed agents. User interface 
agents try to maximize the productivity of current users in- 
teraction with the system by adapting behavior. The issue 
of personalization abounds here. User interface agents that 
can be classified into the Web mining agent category are in- 
formation retrieval agents, information filtering agents, and 
personal assistant agents. Distributed agents technology is 
concerned with problem solving by a group of agents and 
relevant agents in this category are distributed agents for 
knowledge discovery or data mining (for example see Jro[ ) . 
There are two frequently used approaches for developing in- 
telligent agents that help users find and retrieve relevant in- 
formation from the Web jn|, namely content-based and col- 
laborative approaches. In the content-based approach, the 
system searches for items that match based on an analysis of 
the content using the user preferences. In the collaborative 
approach, the system tries to find users with similar inter- 
ests to give recommendations to. The system does this by 
analyzing the user profiles and sessions or transactions. It 
assumes that if some users rate an item high, then the other 
users with similar interests would rate this item high also. 
So this approach mainly uses the usage data (user ratings). 
Viewed in this light we could categorize the content-based 
methods as Web content mining and categorize the collabo- 
rative approaches as Web usage mining. However, collabo- 
rative approaches might also be used or combined with the 
Web content. 

A similar view related to the Web mining categories above 
also exists in the software agent community. Delgado JS^] 
classifies the user interface agents by the underlying infor- 
mation filtering technology into content-based filters, rep- 
utation based filters, collaborative or social-based filters, 
event-based filters, and hybrid filters. In event based fil- 
tering, the system tracks and follows the events that are 
inferred from the surfing habits of people in the Web. Some 
examples of those events are saving a URL into a bookmark 
folder, mouse clicks and scrolls, link traverse behavior, etc. 
We could make an association between these agent-based 
categories with the Web mining categories above. Table 2 
shows the association. 

3. WEB CONTENT MINING 

In this section we list some of the research in the respective 
categories in separate tables. We should note that the lists 
are by no means complete. The explanations on the meth- 
ods surveyed are beyond the scope of this paper. Interested 
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Table 1: Web mining categories 





Web Mining 


Web Content Mining 


Web Structure Mining 


Web Usage Mining 


1R View 


DB View 


View of Data 


- Unstructured 

- Semi structured 


- Semi structured 

- Web site as DB 


- Links structure 


- Interactivity 


Main Data 


- Text documents 

- Hypertext documents 


- Hypertext documents 


- Links structure 


- Server logs 

- Browser logs 


Representation 


- Bag of words, n-grams 

- Terms, phrases 

- Concepts or ontology 

- Relational 


- Edge-labeled graph (OEM) 

- Relational 


- Graph 


- Relational table 

- Graph 


Method 


- TFIDF and variants 

- Machine learning 

- Statistical (including NLP) 


- Proprietary algorithms 

- ILP 

- (Modified) association rules 


- Proprietary algorithms 


- Machine Learning 

- Statistical 

- (Modified) association rules 


Application 
Categories 


- Categorization 

- Clustering 

- Finding extraction rules 

- Finding patterns in text 

- User modeling 


Finding frequent sub- 
structures 

- Web site schema discovery 


- Categorization 

- Clustering 


- Site construction, adapta- 
tion, and management 

- Marketing 

- User modeling 



readers can consult the book by Mitchell [g8[ and the re- 
spective papers for the explanation of the methods. We just 
intend to give a taste on the variety of some representations, 
processes, methods, and applications that have been used. 

3.1 Information Retrieval View 

3.1.1 Information Retrieval View for Unstructured Doc- 
uments 

Table 3 summarizes some of the research done for unstruc- 
tured documents. What we mean by the unstructured docu- 
ments is free texts such as news stories. Most of the research 
in table 3 uses bag of words to represent unstruct ured docu- 
ments. The bag of words or vector representation ]l0€[ ] takes 
single words found in the training corpus as features. This 
representation ignores the sequence in which the words occur 
and is based on the statistic about single words in isolation. 
The features could be Boolean (a word either occurs or does 
not occur in a document), or frequency based (frequency of 
the word in a document). Variations of the feature selection 
include removing the case, punctuation, infrequent words, 
and stop words. The features could be reduced further by 
applying some other feature selection techniques, such as in- 
formation gain, mutual information, cross entropy, or odds 
ratio (see 1911 for the details). Other preprocessing includes 
Latent Semantic Indexing (LSI) |^ that seeks to trans- 
form the original document vectors to a lower dimensional 
space by analyzing the correlational structure of terms in 
the document collection such that similar documents that 
do not share terms are placed in the same topic, and stem- 
ming which reduces words to their morphological roots. For 
example the words "informing" , "information" , "informer" , 
and "informed" would be stemmed to their common root 
"inform" and only the latter word is used as the feature 
instead of the former four. While those pre-processing vari- 
ations are useful for reducing feature set size, the generality 
of their effectiveness over d iffer ent domains for text catego- 
rization tasks are doubted [105]. 

Other feature representations are also possible such as using 
information about word positions in the document [^t| y: 
[sof , usin g n- grams representation (word sequences of length 
up to n) |64; [70j (for example "t he morph ological roots" is a 
tri-gram), using phrases pi 
brown fox that run away" 



107 



125 1 such as "the quick 



using document concept cate- 



gories |44| , using terms J45J such as "annual interest rate" 
or "Wall Street", using hypernyms (linguistic term for the 
"is a" relationship - a do g is an animal, thus "an imal " is a 
hypernym of "dog") [ 107 1 , or using named entities |l24| such 
as people's names, dates, email addresses, locations, organi- 
zations, or URLs. The relational representation (|27j pq] in 
table 3) that we mean here is actually first order logic, a lan- 
guage that is more expressive than propositional logic (for 
instance see ) . For example in the bag of words represen- 
tation features are the frequencies of specific words; using 
a relational representation one might use relationships be- 
tween different words and their positions, e.g. "word X is to 
the left of word Y in the same sentence" . Although different 
types of representations have been used, there is currently 
no study that shows clear advantages of some representa- 
tions over several domains for t ext categorization tasks 
Indeed, Scott and Matwin [ 107 1 compare different represen- 
tations (bag of words, phrase based, and hypernym) but 
found no significant differences in the performance of differ- 
ent representations. 

As we can see from table 3, the commonly used process is 1 
- 2 - 3 - 4, while some others do not use any or only use a 
minimal pre-processing step (process 1-3-4). The name 
and explanation of the four steps are described in sect ion 
2.1 above. The use of text compression techniques Jl24| is 
rather new for the text classification task. The applications 
range from text classification or categorization, event de- 
tection and tracking, finding extraction patterns or rules, 
to finding some interesting patterns in the text documents. 
Event detection and tracking problems are sub-topics of a 
broader initiative called topic detection and tracking (TDT) , 
which is a new line of research related to research in infor- 
mation retrieval and filtering ||. TDT is an initiative to 
investigate the state of the art in finding and following new 
events in a stream of news stories broadcast ||. 
Recently the usage of the term text mining has been a sub- 
ject to controversy. There are at least two controversies that 
we are aware of: one is regarding the usage of the term "min- 
ing" itself |il] and the other one is regarding the meaning 
of the word "knowledge" in knowledge discovery from text 
(KDT) 0]. As far as we know, the term text mining or 
KDT was first proposed by Feldman and Dagan in [|44| . 
They suggest to structure the text documents by means 
of information extraction, text categorization, or applying 
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Table 3: An IR view on Web content mining for unstructured documents 



Author 


Document Rcprcscnta- 


Process 


Method 


Application 


Ahoncn, ct al. |^| 


Bag of words and word 
positions 


1-2-3-4 


Episode rules 


Finding keywords and 
key phrases 

- Discovering grammatical rules 
and collocations 


Billsus and Pazzani Jl^| 


Bag of words 


1-2-3-4 


- TFIDF 

— ^Naive B ayes 


Text classification 


Cohen 


Relational 


1-2-3-4 


- Propositional rule based sys- 
tem 

Inductive Logic Programming 


Text classification 


Dumais, ct al. p^l 


- Bag of words 

— Phrases 


l_2-3-4 


TFIDF 

- D e cis ion trees 

- Naive Bayes 

- Bayes nets 

- Support Vector Machines 


lext categorization 


Fcldman and Dagan |44| 


Concept categories 


1-2-3-4 


Relative entropy 


Finding patterns between con- 
cept distributions in textual 
data 


Fcldman, ct al. Q 


Terms 


1-2-3-4 


Association rules 


Finding patterns across terms 
in textual data 


Frank, ct al. |so|| 


Phrases and their posi- 
tions 


1-2-3-4 


Naive Bayes 


Extracting keyphrases from 
text documents 


Eccitag and McCallum 


Bag of words 


1-3-4 


Hidden Markov Models 


Learning extraction models 


Hofmann |[j2"| 


Bag of words 


1-2-3-4 


Unsupervised statistical 
method 


Hierarchical clustering 


Honkcla, ct al. Q 


Bag of words with n- 
grams 


1-2-3-4 


Self-Organizing Maps 


Text and document clustering 


Junker, ct al. pj| 


Relational 


1-2-3-4 


Inductive Logic Programming 


- Text categorization 

- Learning extraction rules 


Kargupta, et al. ||70| 




Bag of words with n- 
grams 


1-2-3-4 


Unsupervised hierarchical 
clustering 

- Decision trees 

- Statistical analysis 


Text classification and hierar- 
chical clustering 


Nahm and Mo 


mpy [ }A 


1 


Bag of words 


1-2-3-4 


Decision trees 


Predicting (words) relationship 


Nigam, ct al. 


57 | 






Bag of words 


1-3-4 


Maximum entropy 


Text classification 


Scott and Matwin 


107 




- Bag of words 

- Phrases 

- Hypcrnyms and syn- 
onyms 


1-2-3-4 


Rule based system 


Text classification 


Sodcrland Jll r 


-1 


Sentences, and clauses 


1-2-3-4 


Rule learning 


Learning extraction rules 


Weiss, ct al. |1 


nh 




Bag of words 


1-2-3-4 


Boosted decision trees 


Text categorization 


Wiener, ct al. 


L22 




Bag of words 


1-2-3-4 


- Neural Networks 

- Logistic Regression 


Text categorization 


Wittcn, ct alj- 






Named entity 


1-2-3-4 


Text compression 


Named entity classifier 


Yang, ct al. Ill 


•5J 


Bag of words and 
phrases 


1-2-3-4 


- Clustering algorithms 

- k-Nearest Neighbor 

- Decision tree 


Event detection and tracking 
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NLP techniques as pre-processing step before performing 
any kind of KDTs. The reason is mining on the unprepared 
docu ments does not provide effectively exploitable results 
102; [45| . Currently the term text mining has been used 
to d escri b e di fferent application s such as text categorization 



115 



121 1, text clustering 



lOj ], IE ||, empirical 
computational linguistic tasks |61||, exploratory data analy- 
sis [ pl| |, finding patterns in text databases finding 
sequent ial p atterns in texts [^l|; ^; 0], and association dis- 
covery |He| 0. So although some of the papers surveyed 
mention their application as text mining, we use less con- 
troversial names for the applications. 

3.1.2 Information Retrieval View for Semi-Structured 
Documents 

We can see from table 4 that the process used in the works 
surveyed above is 1-2-3-4. We can also see that the 
works surveyed in table 4 use richer representations com- 
pared to the works surveyed in table 3. This is due to 
the additional structural (HTML and hyperlink) informa- 
tion in the hypertext documents. Actually all of the works 
surveyed utilize the HTML structures inside the documents 
and some utilize the hyperlink structure between the doc- 
uments for document representation. The methods that 
are used are common data mining methods. The applica- 
tions ranged from hypertext classification or categorization 
and clustering, learning relations between Web documents, 
learning extraction patterns or rules, and finding patterns 
in semi-structured data. 

3.2 Database View 

As mentioned in ^9|, the database techniques on the Web 
are related to the problems of managing and querying the 
information on the Web. |49| mentions that there are three 
classes of tasks related to those problems: modeling and 
querying the Web, information extraction and integration, 
and Web site construction and restructuring. Although the 
first two tasks are related to Web content mining applica- 
tions, not all the works there are inside the scope of Web 
content mining. This is due to the absence of the machine 
learning or data mining techniques in the process. Basically 
the DB view tries to infer the structure of the Web site or to 
transform a Web site to become a database so that better 
information management and querying on the Web become 
possible. As mentioned previously, the DB view of Web con- 
tent mining mainly tries to model the data on the Web and 
to integrate them so that more sophisticated queries other 
than the keywords based search could be performed. These 
could be achieved by finding the schema of Web documents, 
building a Web warehouse or a Web knowledge base or a 
virtual database. The research done in this area mainly 
deals with semi-structured data. Semi-structured data from 
database view often refers to data that has some structure 
but no rigid schema [Q; llif . 

From table 5, we can see that the DB view uses representa- 
tions that differ from the IR view that we see in table 3 and 
table 4. The DB view mainly uses Object Exchange Model 
(OEM) H that represents semi-structured data by a labeled 
graph. The data in the OEM is viewed as a graph, with ob- 
jects as the vertices and labels on the edges. Each object 
is identified by an object identifier (oid) and a value that 
is either atomic, such as integer, string, gif, html, etc. or 
complex in the form of a set of object references, denoted as 



a set of (label, oid) pairs. All of the processes that are sur- 
veyed above are 1-2-3-4. However, the process used here 
typically starts from manually selected Web sites for doing 
Web content mining instead of searching the whole Internet 
for the specific resources. This is partly due to the appli- 
cations of the DB view that are quite different from those 
of the IR view (which mostly are classification tasks) . The 
process 1 and 2 is typically done by site-specific wrappers 
or parsers for hypertext documents. 

Most of the applications that are surveyed abo ve are the 
task of schema extraction or discovery [ll(| or build- 



ing DataGuides |>5 
DataGuide is a kinc 



|95| ; pr| . Roughly speaking, a schema or 
oTstructural summary of semi-structured 
data. For practical applications and computational reason, 
this summary is often approximated jj]; |H(j]. Some applica- 
tions do not deal with the task of finding the global schema 
but deal with the task of finding frequent substructures (sub- 
schema) in semi-structured data. Another application deals 
with the task of creating multi-layered database (MLDB) 
127 1 in which each layer is obtained by generalizations on 
lower layers and use a special purpose query language for 
Web mining to extract some knowledge from the MLDB of 
Web documents. This is an example of the query perspec- 
tive of data mining. There has been some work on query 
languages for semi-structured data j^; [l!| and for the Web 
~" 17^: BtT: [l7j ]. However, we only see the works by Zaiane, 



et al. 1 127 and Singh, et al. [ 109 1 that are inside the scope 
of Web content mining. 

Due to the different representations used in the DB view, 
most of the methods used for data mining are also differ- 
ent except the ILP methods that could operate on rela- 
tional or graphical data. These differences are partly due 
to the inappropriateness of many existing data mining tech- 
niques, which operate on fl at da ta, to operate on relational 
or graphical data. |58; 



127 1 use proprietary algorithms 



for schema discovery and for the construction of M LDB , |69| 
uses a modified version of association rules, and |l!6fl uses 
an upgraded first order logic version of association rules j^] . 

3.3 About Mining Multimedia Data 

We should note that we have not actually discussed the issue 
of mining multimedia data on the Web. Although multime- 
dia data has been the major focus for many researchers JT^; 
114] and many techniques for multimedia IR and extrac- 
tion have been proposed (for exa mple see |60|), multi medi a 
data mining is still in its infancy 11281 . Uthurusamy [117 
Shapiro et al 



[101 



and Mitchell |89(] assert that working 
towards a unifying framework for representation, problem 
solving, and learning from multimedia data is indeed a chal- 
lenge. Fayyad et al. jifj describes mining the ima ges of sky 
objects taken from satellite. Smyth, et al. Jl 1C| ] describes 
mining images to iden tify small volcanoes on Venus. More 
recent works are [ 128 1 in the application of Web data ware- 
housing and |55| in the application of a medical IR system 
for mining the multimedia data on the Web. For a definition 
and a short survey on multimedia data mining, we refer to 
1128 



4. WEB STRUCTURE MINING 

If in the database view of Web content mining we are inter- 
ested in the structure within Web documents (intra-document 
structure), in Web structure mining we are interested in 
the structure of the hyperlinks within the Web itself (inter- 
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Table 4: An IR view on Web content mining for semi-structured documents 



Author 


Document Representa- 
tion 


Process 


Method 


Application 


Craven, et al. |Q 


Relational and ontol- 
ogy 


1-2-3-4 


- Modified Naive Baycs 

- Inductive Logic Programming 


- Hypertext classification 

- Learning Web page relation 

- Learning extraction rules 


Crimmins, ct al. |3^| 


Phrase , URLs , and 
meta information 


1-2-3-4 


Unsupervised and supervised 
classification algorithms 


- Hierarchical and graphical 
classification 

- Clustering 


Fiirnkranz 


Bag of words and hy- 
perlinks information 


1-2-3-4 


Rule learning 


Hypertext classification 


Joachims, ct al. J6t| 


Bag of words and hy- 
perlinks information 


1-2-3-4 


- TFIDF 

- Reinforcement learning 


Hypertext prediction 


Muslca, ct al. Q 


Bag of words, tags, 
and word positions 


1-2-3-4 


Rule learning 


Learning extraction rules 


Sfeaylik and I 

|108| 


jliassi-Rad 


Localized bag of 
words, and relational. 


1-2-3-4 


Neural networks with reinforce- 
ment learning 


Hypertext (homepage) classifi- 
cation 


Singh, ct al. | 


L09 


Concepts and Named 
entity 


1-2-3-4 


- Modified association rule 

- Classification algorithm 


Finding patterns in semi- 
structured texts 


Sodcrland [ LI 


] 


Sentences , phrases , 
and named entity 


1-2-3-4 


Rule learning 


Learning extraction rules 



Table 5: Web content mining from a database view 



Author 


Document Representa- 
tion 


Process 


Method 


Application 




ieddman and Widom 


OEM 


1-2-3-4 


Proprietary algorithms 


Finding DataGuidc in semi- 
structured data 


i 

I 


itsiimbach and Mecca 
5S| 


Strings and relational 


1-2-3-4 


Proprietary algorithms 


Finding schema in semi- 
structured data 


Ncstorov, ct al. p5[ 


OEM 


1-2-3-4 


Proprietary algorithms 


Finding type hierarchy in semi- 
structured data 


Toivoncn 116] 


OEM 


1-2-3-4 


Upgraded association rules 


Finding useful sub-structure in 
semi-structured data 


Wang and Liu ||6s|| 


OEM 


1-2-3-4 


Modified association rules 


Finding frequent sub- 
structures in semi-structured 
data 


Zaianc and Han 127 1 


Relational 


1-2-3-4 


Attribute-oriented induction 


Multilevel databases 
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document structure). This line of research is inspired by the 
study of social networks and citation analysis |7l]; ^3| . With 
social network analysis we could discover specific types of 
pages (such as hubs, authorities, etc.) based on the incoming 
and outgoing links. Web structure mining utilizes the hyper- 
links structure of the Web to apply social network analysis 
to model the underlying links structure of the Web itself. 
Research done by Kautz et al. jTlj utilizes the network 
analysis of people to model the network of AI researchers. 
They use the name entity data found in close proximity in 
any public Web pages such as the hyperlinks from home 
pages, co-authorship and citation of papers, exchange of in- 
formation between individuals found in net-news archives, 
and organization charts. In our framework, their research 
could be classified as a combination of Web structure and 
content mining. 

Some algorithms have been proposed to model the Web 
topology such as HITS fr^| , PageRank |l6| and improve- 
ments of HITS by adding content information to the links 
structure and by using outlier filtering jiJJ . These mod- 
els are mainly applied as a method to calculate the quality 
rank or relevancy of each Web page. Some examples are 
the Clever system [Q and Google [jlfjj. Some other appli- 
cations of the models include Web pages categorization J25|] 
and discovering micro communities on the Web J75[ . 
More applications of Web structure mining in the context of 
Web warehouse are discussed by Madria, et al. |83| . These 
include measuring the completeness of the Web sites by mea- 
suring the frequency of local links that reside in the same 
server, measuring the replication of Web documents across 
the Web warehouse that helps in identifying the mirrored 
sites for example, and discovering the nature of the hierar- 
chy of hyperlinks in the Web sites of a particular domain to 
study how the flow of information affects the design of the 
Web sites. 



5. WEB USAGE MINING 

Web usage mining focuses on techniques that could predict 
user behavior while the user interacts with the Web. As 
mentioned before, the mined data in this category are the 
secondary data on the Web as the result of interactions. 
These data could range very widely but generally we could 
classify them into the usage data that reside in the Web 
clients, proxy servers and servers [ 1 13 . The Web usage 
mining process could be classified into two commonly used 
approaches JlB) . The first approach maps the usage data of 
the Web server into relational tables before an adapted data 
mining technique is performed. The second approach uses 
the log data directly by utilizing special pre-processing tech- 
niques. As is true for typical data mining applications, the 
issues of data quality and pre-processing are also very im- 
portant here. The typical problem is distinguishing among 
unique users, server sessions, epis odes , etc. in the presence 
of caching and proxy servers 



1 1 3 1 . For the details and 



comparison of some pre-processing methods for Web usage 
data we refer to |3l| . 

In gene ral, typical data mining methods (see for example 
in 113 1) could be used to mine the usage data after the 
data have been pre-processed to the desired form. However, 
modifications of the typical data mining methods are also 
used such as composite association rules Jlifl , an extension 
of a traditional sequence discovery algorithm (MIDAS Jl7|), 



and hypertext probabilistic grammars |lf|. The Web usage 
data could also be represented with graphs JrJ ^|. Of- 
ten the Web usage mining uses some background or domain 
knowledge such as navigation templates, Web content, site 
top ology, concept hierarchies, and syntactic constraints ]l7| ; 

iT| . 

The applications of Web usage mining could be classified 
into two main categories: learning a user profile or user mod- 
eling in adaptive interfaces (personalized) (for examples see 
|79]]) and learning user navigation patterns (impersonalized) 
(for examples see [112 ). Web users would be interested in, 
among others, techniques that could learn their informa- 
tion needs and preferences, which is user modeling possibly 
combined with Web content mining. On the other hand, 
information providers would be interested in, among oth- 
ers, techniques that could improve the effectiveness of the 
information on their Web sites by adapting the Web site 
design or by biasing the user's behavior towards satisfying 
the goals of the site. In other words, they are interested in 
learning user navigation patterns. Then the learned knowl- 
edge could be used for applications such as personalization 
(at a Web site level), system improvement, site modifica- 
tion, business intelligence, and usage characterization (see 
|ll3fl for the detail). It is not in our intention to give a com- 
plete survey of Web usage mining research here. Interested 
rea ders could consult t he ov erview papers by Srivastava, et 
al. [113 1, Spiliopoulou [112|, and Masand and Spiliopoulou 
|8{|, and Robert Cooley's Ph.D. thesis J3^[ for mining user 
patterns and the overview paper by Langley jr9| for mining 
user profiles. 



6. RELATED WORKS 

As far as we know, it was Etzioni [^(| who first coined the 
term Web mining. Etzioni starts by making a hypothesis 
that the information on the Web is sufficiently structured 
and outlines the subtasks of Web mining. His paper de- 
scribes the Web mining processes. There have been some 
works around the survey of data mining on the Web. The 
first paper that we know that noticed the confusion in the 
Web mining research is fjc)| . It gives a Web mining taxon- 
omy but restricted to Web content and Web usage mining, 
and gives a survey on Web usage mining. It divides the 
Web content mining into the agent based approach and the 
database approach. We use a similar division but divide it 
int o th e IR approach instead of the agent approach. Later, 
in [113 1 they classify Web mining into three categories that 
are similar to our categories. Compared to their paper, our 
paper points out three confusions on the usage of the term 
Web mining, identifies additional user-centered Web mining 
processes, and provides new perspectives for the Web mining 
categories. We use the Web mining categories suggested in 
]83| and fll"^ . jl5| proposes a new model for mining Web log 
data, while flS3|~discusses the research issues of Web mining 
in the context of Web warehouse project. 
Carbonell et al. |m) give an overview of the workshop on 
learning from text and the Web that is related to Web con- 
tent (from the IR view) and usage mining. They also give an 
outline of the research directions in that area. Mladenic |9(i|l 
surveys the research on text learning and related intelligent 
agents. She compares two frequently used approaches for de- 
veloping intelligent agents, namely collaborative and content 
based. In our categories, these would be Web content (from 
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the IR view) and usage mining. She also surveys research on 
machine learning applied to text data, which is broader than 
but similar to our discussion in section 3.1.1 about the IR 
view of Web content mining from unstructured documents. 
Carbonell et al. review the emerging research collabora- 
tions between the IR and machine learning communities in 
a special issue of the Machine Learning journal. They also 
indicate some fertile research areas for both communities. 
Garofalakis et al. |Q review some data mining techniques 
and the algorithms for Web mining that specifically take 
into account the hyperlink information. Chakrabarti 
provides a survey of data mining for hypertext. His paper 
mainly surveys the statistical techniques for Web content 
across the continuum of supervised, semi-supervised and un- 
supervised learning, and social network analysis techniques 
for Web structure mining. Levy and Weld |82j wrote a sur- 
vey in the special issue of Artificial Intelligence on intelligent 
Internet systems that we think de scrib es a broader domain 
than Web mining. Vaithyanathan [118 1 gives an overview of 
the papers in the special issue of Artificial Intelligence Re- 
view on data mining on the Internet. He mentions similar 
categories of Web mining as ours, except the database view 
of Web content mining. Some other related work that we 
found recently in special issues of some magazines are the 
following. Yang and Peders en ed ited a special issue on intel- 
ligent information retrieval [126(. Filman and Pant edited a 
special issue on searching the Internet [|48|. 



7. CONCLUSIONS 

In this paper we survey the research in the area of Web min- 
ing. We point out some confusions regarded the usage of the 
term Web mining. We also suggest three Web mining cat- 
egories and then situate some of the research with respect 
to these categories. We also explore the connection between 
Web mining categories and the related agent paradigm. For 
the survey, we focus on representation issues, on the process, 
and on the learning algorithm, and the application of the re- 
cent works as the criteria. The Web presents new challenges 
to the traditional data mining algorithms that work on flat 
data. We have seen that some of the traditional data mining 
algorithms have been extended or new algorithms have been 
used to work on the Web data. 

An interesting direction of Web content mining is the recent 
interest in information integration l26 t roll , which could be 
in the form of a Web knowledge base ]20[f|29{ or Web ware- 
house J&j], or in the form of a mediator (see [^] for exam- 
ples). At least this is the area where database and other 
research communities such as IR, Al, and machine learning 
met recently. Information integration was mainly concerned 
with integrating various databases but has changed its focus 
with the increasing popularity of the Web [^6| . The same is 
also true for the research in IE, which could be thought as 
a mediator or wrapper in the information integration area. 
Information integration also raises some other research ques- 
tions such as scaling up the number of Web sites that could 
be integrated, wrapper maintenance, building and maintain- 
ing a global schema, etc. J2sj (see also j76| for other issues). 
Topic detection and tracking (TDT) is also a promising re- 
search area for IR and machine learning communities that 
raises, among others, temporal issue in the data. It would 
be interesting if the learning algorithm could model this as- 
pect accurately. Some other promising research issues in 



the area of Web content mining are discussed in |20| . Fi- 
nally, another interesting fact is that graph structures oc- 
cur almost everywhere in Web mining research. There are 
many opportunities for (existing or new) machine learning 
algorithms that could work with this representation or that 
could take advantage of the available structures on the Web. 
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